For all problem sets, we will provide you with a single R
Markdown file. Please complete the problem set within a local copy of
this file. To turn in, please upload a fully knitted html version. Make
sure to keep echo=TRUE, as appropriate, to show your
coding.
In Problem Set 6 you were introduced to a stunning dataset analyzing count data for the Angeleno Quail (Callipepla curtii) collected in 2015. As a reminder, three times during the breeding season, surveyors visited 267 survey points doing targeted surveys for the quail.
Last time, you fit a mixed-effects model to the data and found strong effects of elevation, forest cover, an interaction, and wind. You noticed, however, that your dataset included a lot of zeroes, but you didn’t let that disturb you! After this week’s lessons, however, you’ve become a bit more worried, and you wish to try fitting the data in a way that accounts for imperfect detection. Since you have count data, you decide to use a N-mixture model. Like like an occupancy model, but for abundance data, the N-mixture model separates the underlying abundance state process (here, the factors that determine quail abundance) from the observation process (here, how imperfect detection reduces the number of quail actually seen), and models the data as a mixture of these two processes.
After talking to the biologists who surveyed the species, they tell you that they believe that forest cover and elevation likely influence quail abundance, while wind is a very strong factor in their ability to detect quail (but as a variable that changes by the day, wind doesn’t influence their true underlying abundance). The biologists are unsure whether canopy cover (which correlates with ground cover, as well as auditory transmission) impacts detection probability, so it’s probably safer to test for it. You use this natural history knowledge to guide you in your model parameterization.
The data are identical as used for Problem Set 6, so re-load in the R data object.
load("ProblemSet6.Rdata")
Your goal is to run a single N-mixture model in JAGS and learn what you can about the species from the fitted model. The model should include all the major components of this week’s problem set: the effect of elevation, forest cover, the interaction, and wind. Note, however, that you do not have to include a random effect for site with this N-mixture model, because N-mixture – like occupancy – models, assumes that the true underlying state does not change across repeat visits. Thus the “pseudoreplication” of repeat visits is what informs the detection parameter, but abundance covariates are parameterized on the latent variable, \(N\), which is assumed to be unchanging across visits (an assumption called “closure”). You should realize, consequently, that the true abundance at a site (\(N_j\)) is only estimated once for each of \(j\) sites. Consequently, your model should follow the general structure of an N-mixture model, which is:
\(y_{ij}\sim binomial(p_{jk}, N_j)\)
\(N_j \sim Poisson(\lambda_j)\)
where \(p_{jk}\) is a probability of detection at site \(j\) and visit \(k\) that can be modeled as a function of covariates, and \(\lambda_j\) is the expected true abundance at site \(j\), which can also be modeled as a function of covariates.
Draw a DAG for the model you are about the create. Display your
DAG below using the ggdag R package.
Using your DAG, write out the JAGS model code for this model. Use
appropriate vague priors for all parameters. You will need to create two
nested for-loops, a first loop over every site j, and a
second loop over each visit k. Thus, your process model
should be for every count[j,k]. Keep in mind that your
deterministic process and observation models will use different link
functions, which means that a truly ‘vague’ prior for your parameters
may differ depending on whether they are an abundance parameter or a
detection parameter.
qs_mod<-function(){
b0 ~ dnorm(0,0.00001)
b_elev ~ dnorm(0,0.00001)
b_forest ~ dnorm(0,0.00001)
b_int ~ dnorm(0,0.00001)
alpha_0 ~ dnorm(0,0.00001)
alpha_wind ~ dnorm(0,0.00001)
alpha_forest ~ dnorm(0,0.00001)
tau_site ~ dgamma(0.001,0.001)
sigma_site <- 1 / sqrt(tau_site)
for (j in 1:site_num){
N[j] ~ dpois(lam[j])
log(lam[j])<-b0+b_elev*elev[j]+b_forest*forest[j]+b_int*elev[j]*forest[j]+b_site[j]
b_site[j] ~ dnorm(0,tau_site)
for (k in 1:visit_num){
logit(p[j,k])=alpha_0+alpha_wind*wind[j,k]+alpha_forest*forest[j]
count[j,k] ~ dbinom(p[j,k],N[j])
count.new[j,k] ~ dbinom(p[j,k],N[j])
}
}
}
# I do not think it would be advantageous to use the findings from last week's analysis as informed priors for the current analysis. The issue is that in last week's analysis, one of the underlying assumptions is that there is perfect observation of the individuals within the community. However, in this week we are adding imperfect observation into the model. Thus, if we use last week's estimated parameters as priors, the priors may not necessarily cover all the posterior distribution of the parameters this week. More specifically, it is very likely that the posterior distribution for the covariates related to counts, such as b_0 and b_forest, that we gained from this week would be consistently larger than what we got last week due to the effect of imperfect observation being added in.
#Furthermore, we are also removing the effect of wind from the covariates of counting; thus, this would change the structure of the model, which would cause the priors to be inaccurate to use in this week's analysis.
Node inconsistent with parents error from JAGS, this
indicates that you need to give JAGS plausible initial values for the
latent variable, N.set.seed(2025)
count=ps.data[["count"]]
elevation=ps.data[["elev"]]
forest=ps.data[["forest"]]
wind=ps.data[["wind"]]
avg_count=apply(count,1,mean)
site_num=nrow(wind)
visit_num=ncol(wind)
site=rep(1:site_num,each=visit_num)
visit=rep(1:visit_num,times=site_num)
total_sample=site_num*visit_num
jags_data_qs<-list(
site_num=site_num,
visit_num=visit_num,
count=count,
wind=wind,
elev=elevation,
forest=forest
)
qs.fit<-jags(data=jags_data_qs,parameters.to.save=c("b0", "b_elev","b_forest","b_int","tau_site","sigma_site","alpha_0","alpha_wind","alpha_forest"),model.file=qs_mod,
inits=list(list(N=apply(count,1,max)+1),
list(N=apply(count,1,max)+1),
list(N=apply(count,1,max)+1)),
n.chains=3,
n.iter=20000,
n.burnin=5000,
n.thin=10)
## module glm loaded
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 801
## Unobserved stochastic nodes: 1343
## Total graph size: 7492
##
## Initializing model
qs.check<-jags(data=jags_data_qs,parameters.to.save=c("count.new","p"),model.file=qs_mod,
inits=list(list(N=apply(count,1,max)+1),
list(N=apply(count,1,max)+1),
list(N=apply(count,1,max)+1)),
n.chains=3,
n.iter=20000,
n.burnin=5000,
n.thin=10)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 801
## Unobserved stochastic nodes: 1343
## Total graph size: 7492
##
## Initializing model
MCMCsummary(qs.fit)
## mean sd 2.5% 50%
## alpha_0 -0.89359679 0.11327924 -1.12064343 -8.930820e-01
## alpha_forest 0.01305130 0.16897255 -0.29803828 9.500327e-03
## alpha_wind -2.98527991 0.13919744 -3.24938669 -2.989302e+00
## b0 0.62427944 0.08908459 0.44919141 6.251070e-01
## b_elev -2.05852386 0.11907291 -2.29740959 -2.058708e+00
## b_forest 2.14059057 0.12068901 1.90594868 2.143818e+00
## b_int 1.29513484 0.16953891 0.96008017 1.295504e+00
## deviance 1323.42152541 28.44176862 1269.67480612 1.322952e+03
## sigma_site 0.07152209 0.03542327 0.02278854 6.483815e-02
## tau_site 425.04040092 522.73293794 41.02532164 2.378695e+02
## 97.5% Rhat n.eff
## alpha_0 -0.6728587 1.01 180
## alpha_forest 0.3376666 1.00 226
## alpha_wind -2.7015073 1.01 238
## b0 0.7980839 1.01 587
## b_elev -1.8250669 1.00 960
## b_forest 2.3740167 1.00 665
## b_int 1.6244017 1.00 1043
## deviance 1381.8100583 1.02 216
## sigma_site 0.1561256 1.00 1422
## tau_site 1925.6471315 1.00 2805
# most of the Rhat are either 1 or close to 1 thus the model has converged
#However, there are count.new values that have Rhat NaN, which could be problematic?
traceplot(qs.fit)
#as in last week we only calculate hte PPO of fixed effect, excluding the random effect.
pr_norm=rnorm(15000,mean=0,sd=sqrt(1/0.00001))
MCMCtrace(qs.fit,params="b0",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b0", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="b_elev",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_elev", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="b_int",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_int", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="b_forest",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_forest", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="alpha_0",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_0", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="alpha_wind",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_wind", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="alpha_forest",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_forest", priors = pr_norm, :
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
#MCMC for abundance is much more efficient than the detection processes(looks more like a hairy caterpillar, while that for the detection processes are much more diffused). I think one of the main reasons is that the detection parameters are highly correlated to N, hence the value of the abundance parameters in each MCMC step. Thus, as a result, the parameters in the detection processes would demonstrate a higher interrelation between steps when put into an MCMC process: besides the autocorrelation that is introduced by the MCMC process itself, they are also gaining autocorrelation from the MCMC process of the abundance parameters. Thus, the MCMC tends to be less efficient for the detection processes.
MCMCsummary(qs.fit)
## mean sd 2.5% 50%
## alpha_0 -0.89359679 0.11327924 -1.12064343 -8.930820e-01
## alpha_forest 0.01305130 0.16897255 -0.29803828 9.500327e-03
## alpha_wind -2.98527991 0.13919744 -3.24938669 -2.989302e+00
## b0 0.62427944 0.08908459 0.44919141 6.251070e-01
## b_elev -2.05852386 0.11907291 -2.29740959 -2.058708e+00
## b_forest 2.14059057 0.12068901 1.90594868 2.143818e+00
## b_int 1.29513484 0.16953891 0.96008017 1.295504e+00
## deviance 1323.42152541 28.44176862 1269.67480612 1.322952e+03
## sigma_site 0.07152209 0.03542327 0.02278854 6.483815e-02
## tau_site 425.04040092 522.73293794 41.02532164 2.378695e+02
## 97.5% Rhat n.eff
## alpha_0 -0.6728587 1.01 180
## alpha_forest 0.3376666 1.00 226
## alpha_wind -2.7015073 1.01 238
## b0 0.7980839 1.01 587
## b_elev -1.8250669 1.00 960
## b_forest 2.3740167 1.00 665
## b_int 1.6244017 1.00 1043
## deviance 1381.8100583 1.02 216
## sigma_site 0.1561256 1.00 1422
## tau_site 1925.6471315 1.00 2805
#The mean of b0 is 0.624, meaning that the average count for all sites, without the consideration of all the other effect is e^0.624
#Elevation has a highly-certain and strong negative impact on quail counts. The mean of b_elev is -2.059. Thus, when elevation increase by 1, the mean count of the site will change by a factor of e^-2.059
#Both canopy cover and the ineraction between canopy cover and elevation have a positive impact on the count fo quails, and we are very much confident about this effect. On average, when canopy cover and the interaction between canopy cover and elevation increase by 1 unit, the expected abundance of quails at that site is increase by a factor of e^2.14 and e^1.29, respectively
#Sigma_site is positive but relative low: 0.0715, suggesting that most of the variation has been explained by the existing parameters and site do not contain any hidden variables that offer a strong explanatory power to the count of quails.
#alpha_0's effect on observation probability is strongly negative with high certainty. The mean of alpha_0 is -0.894, thus the average odd (not probability) of successful observation without considering other effect is e^-0.894.
#alpha_wind has a very strong negative and highly certain effect on observation success. The mean of alpha_wind is -2.99, suggesting that the average odd of successful observation would change by a factor of e^-2.98 when wind increase by 1 unit. Thus we are highly confident that wind negatively impacts our observation success for quails
#alpha_forest has a 95% confidence interval overlap with 0, thus we are uncertain whether it has any actual impact on the observation success or not. Moreover the mean value is small: 0.0131. Thus it is safe to say that canopy cover does not actually contribute much to the detection process of quails
mean(y) and sd(y). Remember
that the model fit well for the mean but not the sd. Repeat this now,
with our N-mixture model, by calculating the observed and posterior
distributions for both test statistics, and plot each (make sure your
plots show both observed lines and the full posterior). Calculate the
Bayesian p-value for each as well.ppd=MCMCchains(qs.check,params="count.new")
ppd_mean=apply(ppd, 1, mean)
ppd_sd=apply(ppd, 1, sd)
obs_mean=mean(count)
obs_sd=sd(count)
df=data.frame(mean=ppd_mean,sd=ppd_sd)
hist(ppd_mean,breaks=20,main="Posterior Prediction Check for Mean",xlab="Simulated Means",ylab="Frequency",xlim=c(min(ppd_mean,obs_mean)-1,max(ppd_mean,obs_mean)+1))
abline(v=obs_mean,col="red",lwd=2,lty=2)
legend("topright",legend ="Observed mean",col="red",lty=2,lwd=2)
hist(ppd_sd,breaks=20,main="Posterior Prediction Check for SD",xlab="Simulated Sd",ylab="Frequency",xlim=c(min(ppd_sd,obs_sd)-1,max(ppd_sd,obs_sd)+1))
abline(v=obs_sd,col="red",lwd=2,lty=2)
legend("topright",legend ="Observed Standard Deviation",col="red",lty=2,lwd=2)
p_mean=mean(ppd_mean>obs_mean)
p_sd=mean(ppd_sd>obs_sd)
p_mean
## [1] 0.4791111
p_sd
## [1] 0.6746667
#The p values are 0.479 and 0.675
#Thus the model is becoming better in predicting both the mean and standard deviation of the observed dataset
ppd=MCMCchains(qs.check,params="count.new")
b0=MCMCchains(qs.fit,params="b0")
b_elev=MCMCchains(qs.fit,params="b_elev")
b_forest=MCMCchains(qs.fit,params="b_forest")
b_int=MCMCchains(qs.fit,params="b_int")
forest_cov_pred=seq(min(forest),max(forest),by=0.01)
elev_discrete=c(-1,0,1)
plot_list <- list()
for (i in elev_discrete){
median_ppd=c()
lower_ppd=c()
upper_ppd=c()
elev_val_used=c()
prediction=c()
for (j in 1:length(forest_cov_pred)){
forest_cov_val=forest_cov_pred[j]
count=exp(b0+b_elev*i+b_forest*forest_cov_val+b_int*i*forest_cov_val)
lower_ppd[j]=quantile(count,0.055)
upper_ppd[j]=quantile(count,0.945)
elev_val_used[j]=i
prediction[j]=median(count)
}
plot_list[[as.character(i)]] <- data.frame(
forest_cover=forest_cov_pred,
median_pred=prediction,
lower=lower_ppd,
upper=upper_ppd,
elevation=elev_val_used)
}
ggplot(plot_list[["-1"]],aes(x=forest_cover))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=-1)",x="Forest Cover",y="Predicted Abundance",col="Key")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(plot_list[["0"]],aes(x=forest_cover))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=0)",x="Forest Cover",y="Predicted Abundance",col="Key")
ggplot(plot_list[["1"]],aes(x=forest_cover))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=1)",x ="Forest Cover",y = "Predicted Abundance",col="Key")
#The plots look similar in terms of their shape. This is reasonable, as most of the biological processes in terms of how elevation and forest cover are describing the abundance are similar between the two models; thus, they should be similar.
#However, one thing to notice is that the plots above predict much more count per site when compared to what was produced in PS6. This is because we have now taken into account the detection process; thus, the number of quails per site needs to be larger than the amount of quails that are actually observed/the amount of quails when observation is perfect, which is the underlying assumption in PS6. Thus, the counts in the plots above are much higher than those in PS6
a0=MCMCchains(qs.fit,params="alpha_0")
a_forest=MCMCchains(qs.fit,params="alpha_forest")
a_wind=MCMCchains(qs.fit,params="alpha_wind")
p_val_ppd=MCMCchains(qs.check,params="p")
wind_cov_pred=seq(min(wind),max(wind),by=0.01)
forest_discrete=c(-1,0,1)
plot_list <- list()
for (i in forest_discrete){
median_ppd=c()
lower_ppd=c()
upper_ppd=c()
forest_val_used=c()
prediction=c()
for (j in 1:length(wind_cov_pred)){
wind_val=wind_cov_pred[j]
p_val=exp(a0+a_forest*i+a_wind*wind_val)/(1+exp(a0+a_forest*i+a_wind*wind_val))
lower_ppd[j]=quantile(p_val,0.055)
upper_ppd[j]=quantile(p_val,0.945)
forest_val_used[j]=i
prediction[j]=median(p_val)
}
plot_list[[as.character(i)]] <- data.frame(
wind_val_plot=wind_cov_pred,
median_pred=prediction,
lower=lower_ppd,
upper=upper_ppd,
forest_val_plot=forest_val_used)
}
ggplot(plot_list[["-1"]],aes(x=wind_val_plot))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=-1)",x="Wind",y="Probability of observation",col="Key")
ggplot(plot_list[["0"]],aes(x=wind_val_plot))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=0)",x="Wind",y="Probability of observation",col="Key")
ggplot(plot_list[["1"]],aes(x=wind_val_plot))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=1)",x="Wind",y="Probability of observation",col="Key")